Llama 3.2-Vision is a series of multimodal large language models developed by Meta, available in 11B and 90B scales, supporting image + text input and text output, optimized for visual recognition, image reasoning, image captioning, and visual question answering tasks.
Multimodal
TransformersMultiple Languages